Modifying spoken commands

ABSTRACT

A method includes obtaining, at a first conference endpoint device, spoken command data representing a spoken command detected by the first conference endpoint device during a teleconference between the first conference endpoint device and a second conference endpoint device. The method further includes generating modified spoken command data by inserting a spoken phrase into the spoken command. The method further includes transmitting the modified spoken command data to a natural language service.

TECHNICAL FIELD

The present disclosure relates generally to detecting and then modifyingspoken commands.

BACKGROUND

Speech recognition systems are becoming increasingly popular means forusers to interact with computing devices. A variety of speechrecognition services enable users to control such computing devices andgain information without the need for a visual user interface, buttons,or other controls. To illustrate, a speech recognition service canchange the channel on a television, control lights or doors, look upnews, or perform a variety of other tasks based on detected speech.These speech recognition services are often responsive to a ‘wake word’or phrase that indicates to the speech recognition service that a spokencommand may follow. Further, these speech recognition services often areresponsive to phrases that indicate the speech recognition service is tointeract with a third party service. In an illustrative example, aspeech recognition system is configured to search an incoming audiostream for the wake up phrase, and, in response to detecting the wake upphrase, the speech recognition system begins to respond to spokencommands included in the audio stream. In response to determining that aparticular spoken command includes words associated with a third partyservice, the speech recognition service is configured to interact withthe third party service.

Unfortunately, users may forget to use the wake up phrase prior toissuing a spoken command. In such cases, the speech recognition systemignores the user's spoken command. Further, users may forget to say thewords associated with the third party service when trying to interactwith the third party service. In such cases, the speech recognitionsystem attempts to process the user's spoken command without interactingwith the third party service.

SUMMARY

Systems and methods according to the disclosure enable a communicationdevice to modify a spoken command. In some examples, modifying a spokencommand includes adding a wake up phrase to the spoken command prior totransmitting the spoken command to a speech recognition service. In someexamples, modifying the spoken command includes adding a phraseassociated with a third party service to the spoken command prior totransmitting the spoken command to the speech recognition service.Accordingly, the systems and methods enable a user to issue the spokencommand to the speech recognition service without saying a wake-up wordor phrase associated with the speech recognition service. Further, thesystems and methods enable the user to issue the spoken command to athird party service without saying a phrase or word associated with thethird party service.

A method includes obtaining, at a first conference endpoint device,spoken command data representing a spoken command detected by the firstconference endpoint device during a teleconference between the firstconference endpoint device and a second conference endpoint device. Themethod further includes generating modified spoken command data byinserting the spoken phrase into the spoken command. The method furtherincludes transmitting the modified spoken command data to a naturallanguage service.

A computer readable storage medium stores instructions that, whenexecuted by one or more processors, cause the one or more processors toperform operations comprising obtaining, at a first conference endpointdevice, spoken command data representing a spoken command detected bythe first conference endpoint device during a teleconference between thefirst conference endpoint device and a second conference endpointdevice. The operations further include generating modified spokencommand data by inserting the spoken phrase into the spoken command. Theoperations further include transmitting the modified spoken command datato a natural language service.

An apparatus includes one or more processors and a memory storinginstructions that, when executed by the processor, cause the processorto perform operations. The operations include obtaining, at a firstconference endpoint device, spoken command data representing a spokencommand detected by the first conference endpoint device during ateleconference between the first conference endpoint device and a secondconference endpoint device. The operations further include generatingmodified spoken command data by inserting the spoken phrase into thespoken command. The operations further include transmitting the modifiedspoken command data to a natural language service.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments described herein are illustrated by way of example and notlimitation in the figures of the accompanying drawings, in which likereferences indicate similar features. It should be understood that thefull scope of the inventions disclosed herein are not limited to theprecise arrangements, dimensions, and instruments shown. Furthermore, inthe drawings, some conventional details have been omitted so as not toobscure the inventive concepts described herein.

FIG. 1 is a diagram illustrating a communication device for modifyingspoken commands is shown.

FIG. 2 is a diagram illustrating the communication device modifying thespoken command by inserting a wake-up phrase.

FIG. 3 is a diagram illustrating the communication device modifying thespoken command by inserting a third party command phrase.

FIG. 4 is a diagram illustrating the communication device modifying thespoken command by inserting the third party command phrase andtransmitting the spoken command and the modified spoken command to anatural language service.

FIG. 5 is a diagram illustrating the communication device modifying thespoken command by expanding or replacing a phrase included in the spokencommand.

FIG. 6 is a diagram illustrating the communication device modifying thespoken command by removing a first wake-up phrase and inserting a secondwake-up phrase.

FIG. 7 is a diagram illustrating the communication device generatingmore than one modified version of the spoken command, where eachmodified version includes a different wake-up phrase.

FIG. 8 is a flowchart of a method for modifying spoken commands.

FIG. 9 illustrates a computing device corresponding to a communicationdevice and operable to perform one or more methods disclosed herein.

DETAILED DESCRIPTION

Reference to the drawings illustrating various views of exemplaryembodiments is now made. In the following description, numerous specificdetails are set forth, such as specific configurations, methods, etc.,in order to provide a thorough understanding of the embodiments. Atleast one of the described embodiments is practicable without one ormore of these specific details, or in combination with other knownmethods and configurations. In other instances, well-known processes andtechniques have not been described in particular detail to avoidobscuring the embodiments. Reference throughout this specification to“one embodiment,” “an embodiment,” “another embodiment,” “otherembodiments,” “some embodiments,” and their variations means that aparticular feature, structure, configuration, or characteristicdescribed in connection with the embodiment is included in at least oneimplementation. Thus, the appearances of the phrase “in one embodiment,”“in an embodiment,” “in another embodiment,” “in other embodiments,” “insome embodiments,” or their variations in various places throughout thisspecification are not necessarily referring to the same embodiment.Furthermore, the particular features, structures, configurations, orcharacteristics are combinable in any suitable manner in one or moreembodiments. In the drawings and the description of the drawings herein,certain terminology is used for convenience only and is not to be takenas limiting the embodiments of the present disclosure. Furthermore, inthe drawings and the description below, like numerals indicate likeelements throughout.

In the following description and claims, the terms “coupled” and“connected,” along with their derivatives, are not intended as synonymsfor each other. “Coupled” is used to indicate that two or more elementsor components can directly or indirectly communicate with each other.“Connected” is used to indicate that two or more elements or componentsare directly linked with each other.

Any marks that have been referenced herein is by way of example andshall not be construed as descriptive or to limit the scope of theembodiments described herein to material associated only with suchmarks.

The present disclosure enables one of skill in the art to provide asystem to generate a modified spoken command data by inserting one ormore spoken phrases into a spoken command. In a particular example, thesystem detects spoken command “play rock music.” In response to thespoken command, the system generates an audio waveform representing thephrase “play rock music” and modifies the waveform by adding anotherphrase, such as “Wake up!” Accordingly, the system generates a modifiedaudio waveform representing the phrase “Wake up! Play rock music.” Insome implementations, the system is further configured to generate themodified spoken command data by removing a one or more phrases from therepresentation of the spoken command. Referring back to the previousexample, the system may delete the word “music” From the audio waveform.Accordingly, the modified audio waveform may represent the phrase “Wakeup! Play rock.” The system transmits the modified spoken command data toa natural language service for processing.

FIG. 1 illustrates a communication device configured to modify a spokencommand. Modifying a spoken command may include adding one or more wordsto and/or deleting one or more words from the spoken command. Inparticular, the communication device of FIG. 1 generates a modifiedspoken command data (e.g., a waveform or other representation) based onstored data that represents a spoken phrase to be added to the spokencommand.

FIG. 2 illustrates an example of the communication device modifying thespoken command by adding a wake up phrase to the spoken command. FIG. 3illustrates an example of the communication device modifying the spokencommand by adding a third party command phrase to the spoken command.FIG. 4 illustrates an example of the communication device transmittingboth the modified spoken command and the spoken command to the naturallanguage service. FIG. 5 illustrates an example of the communicationdevice modifying the spoken command by expanding or replacing a phraseincluded in the spoken command. FIG. 6 illustrates an example of thecommunication device modifying the spoken command by replacing a firstwake-up phrase included in the spoken command with a second wake-upphrase. FIG. 7 illustrates an example of the communication devicetransmitting different modified versions of the spoken command todistinct natural language services. Each modified version of the spokencommand includes a wake-up phrase associated with the correspondingdestination natural language service.

Referring to FIG. 1, a diagram 100 illustrating a communication device102 for modifying spoken commands is shown. In some implementations, thecommunication device 102 corresponds to a teleconference endpoint deviceconfigured to facilitate audio and/or video communication with otherteleconference endpoint devices. In other examples, the communicationdevice 102 corresponds to a mobile phone or to another type of computingdevice configured to receive and process speech.

In the example illustrated in FIG. 1, the communication device 102includes a sound sensor 130, a memory 132, a processor 134, and an audiooutput device 136. In some implementations, the communication device 102includes additional components other than those illustrated. The soundsensor 130 includes a microphone (e.g., a condenser microphone, adynamic microphone, or to any other type of microphone) and an analog todigital converter (A/D). In some examples, the sound sensor 130 includesa plurality of microphones and/or a plurality of A/Ds. The sound sensor130 is configured to generate sound data based on an acoustic signaldetected by the sound sensor 130.

The processor 134 corresponds to a digital signal processor (DSP), acentral processor unit (CPU), or to another type of processor. In someimplementations, the processor 134 corresponds to a plurality ofprocessor devices. In the illustrative example of FIG. 2, the memory 132stores software 182 and includes random access memory 180. Thecommunication device 102 is configured to load the software 180 into therandom access memory 180 to be executed by the processor 134. Theprocessor 134 is configured to execute the software 182 to perform oneor more operations on the data output by the sound sensor 130. In someexamples, the memory 132 includes a solid state device, an additionalrandom access memory device, a disk drive, another type of memory, or acombination thereof in addition to the random access memory 180. In someimplementations, the memory 132 corresponds to a plurality of memorydevices. The processor 134 is further configured to provide output audiodata to the audio output device 136. The audio output data may be basedon data generated within the communication device 102, data receivedfrom another device, or a combination thereof

The audio output device 136 includes a speaker and a digital to analogconverter (D/A). In some examples, the audio output device 136 includesa plurality of speakers and/or a plurality of D/As. The audio outputdevice 136 is configured to generate an output acoustic signal based onthe output audio data received from the processor 134.

In operation, the sound sensor 130 detects a spoken command 140. In aparticular use case, a user speaks the spoken command 140 during ateleconference facilitated by the communication device 102. A particularexample of the spoken command is “What's the weather like?” In responseto detecting the spoken command 140, the sound sensor 130 generatesspoken command data 142 representing the spoken command 140.

The processor 134 generates modified spoken command data 144 based onthe spoken command data 142 and stored data 160 stored in the memory132. Generating the modified spoken command data 144 may includedeleting one or more words from the spoken command 140, adding one ormore words to the spoken command 140, or a combination thereof. In someexamples, the stored data 160 represents an audio clip of a spokenphrase (e.g., “Wake up!”), and the processor 134 generates the modifiedspoken command data 144 by inserting the spoken phrase into the spokencommand 140. In an illustrative example, the modified spoken commanddata 144 represents an audio clip of the spoken phrase “Wake up! What'sthe weather like?” In some examples, generation of the modified spokencommand data 144 includes removing a word or a phrase from the spokencommand 140. The processor 134 initiates transmission of the modifiedspoken command data 144 to a natural language service (not shown). Insome examples, the natural language service is external to thecommunication device 102. In other examples, the natural languageservice is internal to the communication device 102.

As illustrated in FIG. 1, the communication device 102 receives responsedata 146 from a natural language service. The response data 146represents a response to the modified spoken command data 144. Theresponse data 146 represents audio, video, text, or a combinationthereof. In a particular illustrative example, the response data 146corresponds to an audio clip of the phrase “The temperature in your areais 76 degrees, and there is a 20% chance of rain.” In another example,the response corresponds to a text and/or audio error message (e.g.,“Your command was not recognized”).

In the illustrated example, the processor 134 passes the response data146 to the audio output device 136. In other examples, the processor 134alters the response data 146 before sending the response data 146 to theaudio output device 136. For example, the processor 134 may add and/orremove words or phrases to a response phrase represented by the responsedata 146. Based on the response data 146 (or a modified version thereof)the audio output device 136 generates response audio 150. For example,the audio output device 136 may generate an acoustic signalcorresponding to the spoken phrase “The temperature in your area is 76degrees, and there is a 20% chance of rain.”

Accordingly, in a particular use case, the communication device 102detects a spoken command “what is the weather like?” (e.g., the spokencommand 140) and generates the spoken command data 142 accordingly. Theprocessor 134 modifies the spoken command data 142 by adding the wake upphrase “Wake up!” to the spoken command. Accordingly, in the particularuse case, the modified spoken command data 144 corresponds to the phrase“Wake up! What is the weather like?” The processor 134 initiatestransmission of the modified spoken command data 144 to a naturallanguage service which processes the phrase “Wake up! What is theweather like?” and generates the response data 146 accordingly. Theresponse data 146 corresponds to the phrase “The temperature in yourarea is 76 degrees, and there is a 20% chance of rain.” The processor134 sends the response data 146 to the audio output device 136 whichoutputs the phrase “The temperature in your area is 76 degrees, andthere is a 20% chance of rain,” as the response audio 150.

Thus, the communication device 102 modifies spoken command data beforetransmitting the modified spoken command data to a natural languageservice. Such modification provides a variety of benefits, as explainedfurther below with reference to FIGS. 2-7.

Referring to FIG. 2, a diagram 200 illustrating a use case in which thecommunication device 102 adds a wake up phrase to a spoken command isshown. In particular, the communication device 102 generates themodified spoken command data 144 by inserting a wake-up phrase 222 intothe spoken command 140 before transmitting the modified spoken commanddata 144 to a natural language service 204. In the example of FIG. 2,the wake-up phrase 222 corresponds to the stored data 160 of FIG. 1. Insome implementations, the communication device 102 (i.e., the processor134 generates the modified spoken command data 144 in response todetermining that the spoken command 140 is to be processed by thenatural language service 204 but does not include the wake-up phrase222. The natural language service 204 processes spoken commandsconditionally based on detecting the wake-up phrase 222. For example,the natural language service 204 may parse a speech stream to identifyand process a spoken command that occurs in the speech stream after anoccurrence of the wake-up phrase 222. In some implementations, thenatural language service 204 corresponds to one or more devices arrangedin a cloud architecture associated with providing the natural languageservice. For example, a first device of the natural language service 204may transmit the spoken command (or a representation thereof) to anothercloud based device for processing in response to detecting the wake-upphrase 222.

Because the communication device 102 (i.e., the processor 134) insertsthe wake-up phrase 222 into the spoken command 140 to generate themodified spoken command data 144 before transmitting the modified spokencommand data 144 to the natural language service 204, the naturallanguage service 204 detects the wake-up phrase 222 and provides theresponse data 146 accordingly. Thus, the communication device 102enables a user of the communication device 102 to issue effective spokencommands to the natural language service 204 without uttering thewake-up phrase 222 even though the natural language service 204processes spoken commands conditionally based on detecting the wake-upphrase 222.

The communication device 102 may select the natural language service 204to process the spoken command 140 based on one or more factors. In someimplementations, the one or more factors include costs associated withusing the natural language service 204, a policy associated with thecommunication device 102, historical scores associated with the naturallanguage service 204 processing particular commands, or a combinationthereof. The communication device 102 stores or has access to dataindicating the costs, the policy, the historical scores, or acombination thereof. An example of a policy is “use natural languageservice X for spoken command Y.” The historical scores may be based onuser feedback received at the communication device 102 and/or othercommunication devices. For example, the communication device 102 mayreceive user feedback indicating the user's satisfaction with theresponse audio 150, and the communication device 102 may generate ahistorical score associated with the natural language service 204processing the spoken command 140. This historical score may be used inthe future by the communication device 102 to determine how to processfuture spoken commands.

Referring to FIG. 3, a diagram 300 illustrating a use case in which thecommunication device 102 generates the modified spoken command data 144by inserting a third party command phrase 322 into the spoken command140 before transmitting the modified spoken command data 144 to thenatural language service 204. The third party command phrase 322corresponds to the stored data 160 of FIG. 1. In the example of FIG. 3,the communication device 102 is configured to selectively generate themodified spoken command data 144 in response to determining that thespoken command 140 is associated with a third party service 304. Forexample, the communication device 102 may maintain a data structure(e.g., a table) that indicates which spoken commands are associated withthe third party services and, in response to determining that a detectedspoken command is associated with a particular third party service,insert a corresponding third party command phrase into the spokencommand. In some implementations, the third party service 304corresponds to an application executed by the natural language service204 rather than to a separate device. Similarly, the natural languageservice 204 may correspond to an application executed by thecommunication device 102.

The natural language service 204 is configured to communicate thirdparty data 346 to a third party service 304 in response to the thirdparty command phrase 322. The third party service 304 provides one ormore services that the natural language service 204 may be unable or notconfigured to provide inherently. Examples of the one or more servicesinclude a news service, a teleconference service, a music service, orany other type of service. The third party data 346 is based on themodified spoken command data 144. In a particular implementation, thethird party data 346 represents a transcript of the modified spokencommand, or a portion thereof, as represented by the modified spokencommand data 144. In other examples, the third party data 346corresponds to an application programming interface (API) call selectedby the natural language service 204 based on the modified spoken commanddata 144.

In the illustrated example, the third party service 304 provides theresponse data 146 to the natural language service 204 and the naturallanguage service 204 provides the response data 146 to the communicationdevice 102. In other examples, the third party service 304 providesoutput data to the natural language service 204 that the naturallanguage service 204 uses to generate the response data 146. Forexample, the third party service 304 may provide a transcript of theresponse to the natural language service 204, and the natural languageservice 204 may generate the response data 146 based on the transcriptof the response. In still other examples, the third party service 304transmits the response data 146 directly to the communication device102.

Because the communication device 102 (i.e., the processor 134) insertsthe third party command phrase 322 into the spoken command 140 togenerate the modified spoken command data 144 before transmitting themodified spoken command data 144 to the natural language service 204,the natural language service 204 detects the third party command phrase322 and interacts with the third party service 304 to provide theresponse data 146 accordingly. Thus, a user of the communication device102 can access third party functions via the natural language service204 without uttering the third party command phrase 322 associated withthe third party service.

In a particular example, the communication device detects the phrase“Play music” as the spoken command 140. In response to associating thecommand “Play music” with Music Application 1 (e.g., the third partyservice 304), the communication device 102 adds the phrase “on MusicApplication 1” (e.g., the third party command phrase 322) to the spokencommand 140. Thus, the modified spoken command data 144 represents thephrase “Play music on Music Application 1.” The natural language service204 parses the modified command and determines that the modified commandis to be resolved by the Music Application 1 service based on the phrase“on Music Application 1.” Accordingly, the natural language service 204transmits a transcript of the command “Play music on Music Application1” to the Music Application 1 service. In response to the transcript,the Music Application 1 service begins to stream music to thecommunication device 102 via the natural language service 204. In otherexamples, the music stream may be established directly between the MusicApplication 1 service and the communication device 102. Accordingly, auser may interact with the Music Application 1 service via thecommunication device 102 and the natural language service 204 withoututtering the phrase “on Music Application 1” that is associated withtriggering the natural language service 204 to interact with the MusicApplication 1 service.

Referring to FIG. 4, a diagram 400 illustrating a use case in which thecommunication device 102 transmits both the modified spoken command data144 (including the third party command phrase 322 depicted in FIG. 3)and the spoken command data 142 to the natural language service 204 isshown. As described above, the natural language service 204 transmitsthe third party data 346 to the third party service 304 based on themodified spoken command data 144 and receives the response data 146 fromthe third party service 304. The natural language service 204 transmitsthe response data 146 to the communication device 102. In addition, thenatural language service 204 processes the spoken command data 142according to the natural language service provided by the naturallanguage service 204. Based on the spoken command data 142, the naturallanguage service 204 generates and transmits additional response data404 to the communication device 102.

In some examples, the communication device 102 determines which of theresponse data 404, 146 to use to generate output audio based on contentof the response data 146 of the response from the third party service304. In cases where the response data 146 from the third party service304 corresponds to an error, the communication device 102 outputs audiobased on the additional response data 404 from the natural languageservice 204. In the illustrated example, the communication device 102outputs the response audio 150 based on the response data 146 from thethird party service 304 (e.g., because the response data 146 does notcorrespond to an error message). In other examples, the communicationdevice 102 generates the response audio 150 based on both the responsedata 146 and the additional response data 404. Thus, in contrast to FIG.3 in which the communication device 102 selectively modifies spokencommands, the example of FIG. 4 illustrates a use case in which thecommunication device 102 modifies each detected spoken command,transmits the modified spoken command and the original spoken command tothe natural language service, and selectively generates audio output.

Referring to FIG. 5, a diagram 500 illustrating a use case in which thecommunication device 102 modifies the spoken command 140 by expandingand/or replacing phrase 520 included in the spoken command. In theillustrated example, the communication device 102 replaces the phrase520 with an expanded phrase 540 in response to determining that thephrase 520 is associated with the expanded phrase 540. In someembodiments, the communication device 102 stores a mapping of phrases toexpanded phrases. The communication device 102 may generate the mappingbased on input received from a user.

The modified spoken command data 144 represents the spoken command 140with the phrase 520 replaced by the expanded phrase 540. Accordingly, auser may interact with the natural language service 204 using theexpanded phrase by uttering the phrase 520.

In an illustrative example, a user accesses a configuration setting ofthe communication device 102 and maps the phrase “Play music with setupA” to the expanded phrase “Play music in living room at volume level 5via Music service 1.” During use, in response to detecting the phrase“Play music with setup A,” the communication device 102 expands thephrase to “Play music in living room at volume level 5 via Music service1” and transmits the expanded phrase to the natural language service204. Accordingly, the user may initiate a relatively lengthy spokencommand by uttering a relatively shorter phrase.

Referring to FIG. 6, a diagram 600 illustrating the communication devicegenerating the modified spoken command data 144 by removing a firstwake-up phrase 620 from the spoken command 140 and inserting a secondwake-up phrase 622 into the spoken command 140. The communication device144 then transmits the modified spoken command data 144 to a secondnatural language service 604 associated with the second wake-up phrase622. The communication device 102 receives the response data 146 fromthe second natural language service 604.

The communication device 102 replaces the first wake-up phrase 620 withthe second wake-up phrase 622 in response to determining to use thesecond natural language service 604 rather than the first naturallanguage service 602 to process the spoken command 140. Thecommunication device 102 determines which of the natural languageservices 602, 604 to use based on one or more factors as explainedabove. While only two natural language services 602, 604 areillustrated, in some examples the communication device 102 selects frommore than two natural language services.

In an illustrative example of the use case illustrated by FIG. 6, thecommunication device 102 detects “Wake up service 1, play music” as thefirst wake-up phrase 620. The “phrase wake up service 1” is a wake-upphrase associated with the first natural language service 602. Inresponse to determining that the second natural language service 604 isbetter suited to processing the phrase “play music” (e.g., due to cost,historical accuracy, etc.) the communication device 102 changes thespoken command to “Wake up service 2, play music” and transmits thechanged spoken command to the second natural language service 604. Thus,a spoken command may be routed to a more effective natural languageservice than the one selected by the user.

Referring to FIG. 7, a diagram 700 illustrating the communication device102 transmitting modified spoken command data to more than one naturallanguage service rather than selecting one natural language service isshown. As illustrated, the communication device 102 inserts the firstwake-up phrase 620 into the spoken command 140 to generate the modifiedspoken command data 144. The communication device 102 further generatesadditional spoken command data 744 by inserting the second wake-upphrase into the spoken command 140. The communication device 102transmits the modified spoken command data 144 to the first naturallanguage service 602 and transmits the additional modified spokencommand data 744 to the second natural language service 604.

The first natural language service 602 processes the modified spokencommand data 144 and generates the response data 146. Similarly, thesecond natural language service 604 processes the additional modifiedspoken command data 744 and generates additional response data 746. Thefirst natural language service 602 transmits the response data 146 tothe communication device, and the second natural language service 604transmits the additional response data 746 to the communication device102.

In the illustrated example, the communication device 102 generates theresponse audio 150 based on the response data 146. In someimplementations, the communication device 102 selects which of theresponse data 146, 746 to use to generate the response audio 150 basedon content of the responses, based on a policy, based on historicalscores associated with the natural language services 602, 604 processingparticular commands, or a combination thereof. To illustrate, thecommunication device 102 may select a response that does not indicate anerror. An example of a policy is “select the response from service A ifthe response does not indicate an error.” In some implementations, thecommunication device 102 generates the response audio 150 based on morethan response. For example, the communication device 102 may generatethe response audio 150 based on both the response data and theadditional response data 746. Thus, FIG. 7 illustrates an example inwhich the communication device 102 solicits responses to a spokencommand from more than one natural language service and selectivelyoutputs audio based on the responses.

The examples illustrated in FIGS. 2-7 may be used in combination. In aparticular example, the communication device 102 generates the modifiedspoken command data 144 by inserting the wake-up phrase 222 and thethird party command phrase 322 into the spoken command 140. Thus, FIGS.2-7 illustrate various techniques for modifying a spoken command thatmay be used individually or in combination.

Referring to FIG. 8, an illustration of a method 800 for modifyingspoken commands is shown. In particular embodiments, the method 800 isperformed by a communication device, such as the communication device102. The method 800 includes detecting a spoken command, at 802. Forexample, the sound sensor 130 of the communication device 102 detectsthe spoken command 140. The method 800 further includes generatingspoken command data, at 804. For example, the sound sensor 130 generatesthe spoken command data 142. The method 800 further includes determiningwhether the spoken command includes a wake-up phrase, at 806. Forexample, the communication device 102 determines whether the spokencommand data 142 includes a wake-up phrase associated with a naturallanguage service. The method 800 further includes determining whether awake-up phrase is to be added in response to determining that the spokencommand does not include a wake-up phrase, at 810. If a wake-up phraseis to be added, the method includes adding a wake-up phrase, at 811. Forexample, as illustrated in FIG. 2, the communication device 102generates the modified spoken command data 144 of by inserting thewake-up phrase 222 into the spoken command 140.

The method 800 further includes, in response to determining that thespoken command includes a wake-up phrase, determining whether thewake-up phrase should be replaced, at 808. For example, thecommunication device 102 determines whether to replace the first wake-upphrase 620 with the second wake-up phrase 622, as illustrated in FIG. 6.

The method 800 further includes, in response to determining that thewake-up phrase should be replaced, generating modified spoken commanddata by replacing the wake-up phrase at 812. For example, thecommunication device 102 generates the modified spoken command data 144by replacing the first wake-up phrase 620 with the second wake-up phrase622, as illustrated in FIG. 6.

The method 800 further includes after generating modified spoken commanddata by adding or replacing a wake-up phrase or after determining that awake-up phrase should not be added or replaced, determining whether thespoken command includes a third party command phrase, at 814. Forexample, the communication device 102 determines whether the spokencommand data 142 includes a third party command phrase, as illustratedin FIG. 3.

The method 800 further includes in response to determining that thespoken command does not include a third party command phrase,determining whether a third party command phrase should be added to thespoken command, at 816. For example, the communication device 102determines whether to add the third party command phrase 322 to thespoken command data 142, as illustrated in FIG. 3, based on dataindicating that the spoken command 140 is associated with the thirdparty service 304. Alternatively, the communication device 102determines to add the third party command phrase 322 to each spokencommand, as illustrated in FIG. 4.

The method 800 further includes, in response to determining that thirdparty command phrase should be added, generating modified spoken commanddata that includes the third party command phrase, at 820. For example,the communication device 102 adds the third party command phrase to themodified spoken command data 144, as shown in FIGS. 3 and 4.

The method 800 further includes, in response to determining that thespoken command includes a third party command phrase, that no thirdparty command phrase should be added to the spoken command, or thatmodified spoken command data that includes the third party commandphrase has been generated, determining whether the spoken command and/ormodified spoken command includes a phrase to expand, at 822. In responseto determining that the spoken command and/or the modified spokencommand includes a phrase to expand, the method 800 further includesgenerating modified spoken command data that includes an expandedphrase, at 824. For example, the communication device 102 may replacethe phrase 520 in the spoken command 140 with the expanded phrase 540 inthe modified spoken command data 144.

The method 800 further includes, in response adding the expanded phraseor determining not to add the expanded phrase, transmitting the spokencommand data and/or the modified spoken command data, at 826. Forexample, the communication device 102 transmits the modified spokencommand data 144, the spoken command data 142, or a combination thereofto a natural language service. In some implementations, thecommunication device 102 further determines whether to send additionalmodified spoken command data to additional natural language services, asillustrated in FIG. 7.

Thus, FIG. 8 illustrates an example of a method usable to modify aspoken command. A device operating according to the method describedenables a user to interact with a natural language service withoutspeaking a wake-up phrase associated with the natural language service.The device operating according to the method further enables the user toaccess one or more third party services to process spoken commands thatare not directly supported by the natural language service withoutspeaking a command phrase associated with the third party service. Inaddition, the device operating according to the method enables the userto utter a relatively short phrase to activate a command associated witha relatively longer expanded phrase. Accordingly, a device operatingaccording to the method of FIG. 8 may be more convenient to use ascompared to other devices that interact with natural language services.

Referring now to FIG. 9, a block diagram illustrates a computing device900 that is usable to implement the techniques described herein inaccordance with one or more embodiments. For example, in someimplementations, the computing device 900 corresponds the communicationdevice 102. As shown in FIG. 9, the computing device 900 can include oneor more input/output devices, such as a network communication unit 908that could include a wired communication component and/or a wirelesscommunications component, which can be coupled to processor element 902.The network communication unit 908 corresponds to one or moretransceiver unit(s) that utilize one or more of a variety ofstandardized network protocols, such as Wi-Fi, Ethernet, TCP/IP, etc.,to effect communications between devices.

The computing device 900 includes a processor element 902 that containsone or more hardware processors, where each hardware processor has asingle or multiple processor cores. In one embodiment, the processorelement 902 includes at least one shared cache that stores data (e.g.,computing instructions) that are utilized by one or more othercomponents of processor element 902. In a particular example, the sharedcache corresponds to locally cached data stored in a memory for fasteraccess by components of the processor element 902. In one or moreembodiments, the shared cache includes one or more mid-level caches,such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels ofcache, a last level cache (LLC), or combinations thereof. Examples ofprocessors include, but are not limited to, a central processing unit(CPU), a microprocessor, and a digital signal processor (DSP), agraphics processing unit (GPU), an application specific integratedcircuit (ASIC), and a field-programmable gate array (FPGA). In someimplementations, the processor element 902 corresponds to the processor134.

FIG. 9 illustrates that a memory 904 is operatively coupled to theprocessor element 902. In some embodiments, the memory 904 correspondsto a non-transitory medium configured to store various types of data. Inan illustrative example, the memory 904 includes one or more memorydevices that comprise a non-volatile storage device and/or volatilememory. Examples of non-volatile storage devices include disk drives,optical drives, solid-state drives (SSDs), tap drives, flash memory,read only memory (ROM), and/or any other type memory designed tomaintain data for a duration time after a power loss or shut downoperation. An example of volatile memory is random access memory (RAM).In the illustrated example, the memory 904 stores modificationinstructions 912. The modification instructions 912 are executable bythe processor element 902 to perform one or more of the operations ormethods described with respect to FIGS. 1-8. In particular, themodification instructions 912 are executable by the processor element902 to generate modified spoken command data by adding a spoken phraseto the spoken command.

Persons of ordinary skill in the art are aware that software programsmay be developed, encoded, and compiled in a variety computing languagesfor a variety software platforms and/or operating systems andsubsequently loaded and executed by the processor element 902. In oneembodiment, the compiling process of the software program transformsprogram code written in a programming language to another computerlanguage such that the processor element 902 is able to execute theprogramming code. For example, the compiling process of the softwareprogram may generate an executable program that provides encodedinstructions (e.g., machine code instructions) for processor element 902to accomplish specific, non-generic, particular computing functions.

After the compiling process, the encoded instructions are then loaded ascomputer executable instructions or process steps to the processorelement 902 from storage (e.g., the memory 904) and/or embedded withinthe processor element 902 (e.g., cache). The processor element 902executes the stored instructions or process steps in order to performoperations or process steps to transform the computing device into anon-generic, particular, specially programmed machine or apparatus.Stored data, e.g., data stored by a storage device, can be accessed bythe processor element 902 during the execution of computer executableinstructions or process steps to instruct one or more components withinthe computing device 900.

In the example of FIG. 9, the computing device further includes a userinterface 910 can include a display, positional input device (such as amouse, touchpad, touchscreen, or the like), keyboard, or other forms ofuser input and output devices. The user interface 910 can be coupled toprocessor element 902. Other output devices that permit a user toprogram or otherwise use the computing device can be provided inaddition to or as an alternative to network communication unit 908. Whenthe output device is or includes a display, the display can beimplemented in various ways, including by a liquid crystal display (LCD)or a cathode-ray tube (CRT) or light emitting diode (LED) display, suchas an OLED display. Some implementations of the computing device do notinclude the user interface 910.

The computing device 900 further includes a digital to analog converter(D/A) 921 coupled to the processor element 902 and to a speaker 922. Insome implementations, the D/A 921 and the speaker 922 correspond to theaudio output device 136. The computing device 900 further includes ananalog to digital converter (A/D) 923 coupled to the processor element902 and to a microphone 924. In some implementations, the A/D 923 andthe microphone 924 correspond to the sound sensor 130. The microphone924 and the A/D 923 are configured to generate a digital representationof a spoken command detected by the microphone 924 to the processorelement 902. The D/A 921 and the speaker 922 are configured to output anacoustic signal based on a digital representation of a response receivedfrom the processor element 902. It should be noted that, in someembodiments, the computing device 900 comprises other components, suchas sensors and/or powers sources, not explicitly shown in FIG. 9.

As discussed above, the systems and methods described above withreference to FIGS. 1-9 enable a system to modify a spoken command byadding a spoken phrase to the spoken command such that a user is able tointeract with a natural language service without speaking a wake-upphrase associated with the natural language service. Furthermodification of the spoken command allows the user to engage one or morethird party services to process spoken commands that are not directlysupported by the natural language service without speaking a commandphrase associated with the third party service. Accordingly, the devicemay be more convenient to use as compared to other devices that interactwith natural language services.

In a first particular example, the computing device 900 corresponds to asmart speaker, such as an Amazon Echo® device (Amazon Echo is aregistered trademark of Amazon Technologies, Inc. of Seattle, Wash.).The smart speaker device is configured to receive and respond to voicecommands spoken by a user. In a second particular example, the computingdevice 900 corresponds to a different type of device executing anintelligent personal assistant service, such as Alexa® (Alexa is aregistered trademark of Amazon Technologies, Inc. of Seattle, Wash.),that is responsive to voice commands. In particular use cases, the smartspeaker modifies spoken commands prior to transmitting the spokencommands to a backend associated with the natural language service or toa third party service.

In a third particular example, the computing device 900 corresponds to aconference endpoint device (e.g., a video and/or voice conferencedevice). The conference endpoint device is configured to exchange audioand/or video signals with another conference endpoint during a video oraudio conference. The conference endpoint device is further configuredto respond to voice commands using one or more natural languagerecognition services, such as Alexa®, Siri® (Siri is a registeredtrademark of Apple Inc. of Cupertino, Calif.), Cortana® (Cortana is aregistered trademark of Microsoft Corporation of Redmond, Wash.), etc.The conference endpoint modifies detected spoken commands, as describedherein, before transmitting the spoken commands to the natural languagerecognition service(s) for processing.

In a first use case of the third particular example, the conferenceendpoint detects that a user has spoken a command (e.g., “Play musicI'll like.”) but has not spoken a wake up phrase (e.g., “Alexa”)associated with a natural language recognition service. The conferenceendpoint modifies the spoken command by prepending the wake up phraseand then transmits the modified spoken command (e.g., “Alexa, play musicI'll like”) to the natural language recognition service. Accordingly,the natural language recognition service will detect the wake up phraseand then process the spoken command. The conference endpoint receives aresponse to the spoken command from the natural language service andresponds accordingly. To illustrate, the conference endpoint may outputmusic received from the natural language recognition service in responseto the spoken command.

In a second use case of the third particular example, the conferenceendpoint detects that a user has spoken a command (e.g., “Play music”)associated with a third party skill registered to the natural languageservice without speaking a phrase associated with activating the thirdparty skill. For example, the user may say “play music” without saying“on Spotify®” (Spotify is a registered trademark of Spotify ABCorporation of Stockholm, Sweden). In response, to detecting the spokencommand, the conference endpoint modifies the spoken command byprepending or appending the phrase associated with activating the thirdparty skill. The conference endpoint then transmits the modified spokencommand (e.g., Play music on Spotify) to the natural languagerecognition service. Accordingly, the natural language recognitionservice can forward the modified spoken command to a service (e.g.,Spotify) associated with the third party skill.

In a third use case of the third particular example, the conferenceendpoint modifies a spoken command for use with a third party skill buttransmits both the original spoken command (e.g., “Play music”) and themodified spoken command (e.g., “Play music on Spotify”) to the naturallanguage recognition service. The conference endpoint determines whetherto output a response from the natural language recognition service orthe third party service based on content of the responses. For example,the conference endpoint may play music from Amazon in cases where anerror message is received from Spotify but may play music from Spotifyin cases where music is received from Spotify.

In a fourth use case of the third particular example, the conferenceendpoint is configured to expand shortcut phrases before transmitting tothe natural language recognition service. For example, in response toreceiving the command “Play music with setup A,” the conference endpointmay transmit “Play music in living room at volume level 5 via Spotify”to the natural language recognition service.

In a fifth use case of the third particular example, the conferenceendpoint modifies a spoken command by replacing a first wake up phrase(e.g., “Alexa”) with a second wake up phrase (e.g., “Cortana”).Accordingly, the conference endpoint can transmit the modified spokencommand to a natural language recognition service that is more suitablefor processing the spoken command. The conference endpoint may determinewhich natural language service is more suitable based on stored userpreferences.

In a sixth use case of the third particular example, the conferenceendpoint generates a different version of a spoken command (e.g., “Playmusic”) for each of several natural language recognition services. Forexample, the conference endpoint may send “Alexa, play music” to oneservice and “Cortana, play music” to another service. The conferenceendpoint transmits the versions of the spoken command to thecorresponding natural language recognition services in parallel. Theconference endpoint may receive responses from each of the naturallanguage recognition services and decide which one to output based onuser preferences, content of the responses, or a combination thereof.

As illustrated by the various examples, the disclosed embodimentsrepresent an improvement to user interfaces that operate on detectedspeech. In particular, the disclosed embodiments are more resilient touser error as compared to other systems because the disclosedembodiments insert phrase a phrase inadvertently omitted from a spokencommand. Further, the disclosed embodiments are more convenient to useas the length of spoken commands uttered by users may be reduced.Accordingly, the disclosed systems and methods represent an improvementto how computing devices provide user interfaces. In particular, thedisclosed systems and methods represent an improvement to how computingdevices process spoken commands to provide a user interface.

At least one embodiment is disclosed and variations, combinations,and/or modifications of the embodiment(s) and/or features of theembodiment(s) made by a person having ordinary skill in the art arewithin the scope of the disclosure. Alternative embodiments that resultfrom combining, integrating, and/or omitting features of theembodiment(s) are also within the scope of the disclosure.

Use of the term “optionally” with respect to any element of a claimmeans that the element is required, or alternatively, the element is notrequired, both alternatives being within the scope of the claim. Use ofbroader terms such as comprises, includes, and having is understood toprovide support for narrower terms such as consisting of, consistingessentially of, and comprised substantially of. Accordingly, the scopeof protection is not limited by the description set out above but isdefined by the claims that follow, that scope including all equivalentsof the subject matter of the claims. Each and every claim isincorporated as further disclosure into the specification and the claimsare embodiment(s) of the present disclosure.

It is to be understood that the above description is intended to beillustrative, and not restrictive. For example, the above-describedembodiments are useable in combination with each other. Many otherembodiments will be apparent to those of skill in the art upon reviewingthe above description. The scope of the invention therefore should bedetermined with reference to the appended claims, along with the fullscope of equivalents to which such claims are entitled. It should benoted that the discussion of any reference is not an admission that itis prior art to the present invention, especially any reference that hasa publication date after the priority date of this application.

What is claimed is:
 1. A method comprising: obtaining, at a firstconference endpoint device, spoken command data representing a spokencommand detected by the first conference endpoint device during ateleconference between the first conference endpoint device and a secondconference endpoint device; generating modified spoken command data byinserting a spoken phrase into the spoken command; and transmitting themodified spoken command data to a natural language service.
 2. Themethod of claim 1, wherein the spoken phrase corresponds to a wake-upphrase associated with the natural language service.
 3. The method ofclaim 2, wherein generating the modified spoken command data furtherincludes removing a second wake-up phrase from the spoken command, thesecond wake-up phrase associated with a second natural language service.4. The method of claim 2, further comprising: generating additionalmodified spoken command data by inserting an additional spoken phraseinto the spoken command; and transmitting the additional modified spokencommand data to an additional natural language service.
 5. The method ofclaim 4, further comprising: receiving first response data from thenatural language service; receiving second response data from theadditional natural language service; and in response to the firstresponse data indicating an error, outputting audio based on the secondresponse data.
 6. The method of claim 4, further comprising: receivingfirst response data from the natural language service; receiving secondresponse data from the additional natural language service; andoutputting audio based on the first response data and the secondresponse data.
 7. The method of claim 1, wherein the spoken phrasecorresponds to a third party command phrase, and wherein the naturallanguage service is configured to transmit data to a third party serviceresponsive to the third party command phrase.
 8. The method of claim 1,further comprising transmitting the spoken command data to the naturallanguage service.
 9. A computer readable storage medium storinginstructions that, when executed by one or more processors, cause theone or more processors to perform operations comprising: obtaining, at afirst conference endpoint device, spoken command data representing aspoken command detected by the first conference endpoint device during ateleconference between the first conference endpoint device and a secondconference endpoint device; generating modified spoken command data byinserting a spoken phrase into the spoken command; and transmitting themodified spoken command data to a natural language service.
 10. Thecomputer readable storage medium of claim 9, wherein the spoken phrasecorresponds to a wake-up phrase associated with the natural languageservice.
 11. The computer readable storage medium of claim 10, whereingenerating the modified spoken command data further includes removing asecond wake-up phrase from the spoken command, the second wake-up phraseassociated with a second natural language service.
 12. The computerreadable storage medium of claim 10, further comprising: generatingadditional modified spoken command data by inserting an additionalspoken phrase into the spoken command; and transmitting the additionalmodified spoken command data to an additional natural language service.13. The computer readable storage medium of claim 12, wherein theoperations further comprise: receiving first response data from thenatural language service; receiving second response data from theadditional natural language service; and in response to the firstresponse data indicating an error, outputting audio based on the secondresponse data.
 14. The computer readable storage medium of claim 12,wherein the operations further comprise: receiving first response datafrom the natural language service; receiving second response data fromthe additional natural language service; and outputting audio based onthe first response data and the second response data.
 15. The computerreadable storage medium of claim 9, wherein the spoken phrasecorresponds to a third party command phrase, and wherein the naturallanguage service is configured to transmit data to a third party serviceresponsive to the third party command phrase.
 16. The computer readablestorage medium of claim 9, wherein the operations further includetransmitting the spoken command data to the natural language service.17. An apparatus comprising: one or more processors; and a memorystoring instructions that, when executed by the one or more processors,cause the one or more processors to perform operations comprising:obtaining, at a first conference endpoint device, spoken command datarepresenting a spoken command detected by the first conference endpointdevice during a teleconference between the first conference endpointdevice and a second conference endpoint device; generating modifiedspoken command data by inserting a spoken phrase into the spokencommand; and transmitting the modified spoken command data to a naturallanguage service.
 18. The apparatus of claim 17, wherein the spokenphrase corresponds to a wake-up phrase associated with the naturallanguage service.
 19. The apparatus of claim 18, wherein generating themodified spoken command data further includes removing a second wake-upphrase from the spoken command, the second wake-up phrase associatedwith a second natural language service.
 20. The apparatus of claim 18,further comprising: generating additional modified spoken command databy inserting an additional spoken phrase into the spoken command; andtransmitting the additional modified spoken command data to anadditional natural language service.